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We consider the classical problem of estimating a vector 11 = 
Qui, . . . , fin) based on independent observations Yi ~ N(fj,i, 1), i — 
l,...,n. 

Suppose Hi, i — 1, . . . , n are independent realizations from a com- 
pletely unknown G. We suggest an easily computed estimator p,, such 
that the ratio of its risk E((i — Li) 2 with that of the Bayes procedure 
approaches 1. A related compound decision result is also obtained. 

Our asymptotics is of a triangular array; that is, we allow the dis- 
tribution G to depend on n. Thus, our theoretical asymptotic results 
are also meaningful in situations where the vector fi is sparse and the 
proportion of zero coordinates approaches 1. 

We demonstrate the performance of our estimator in simulations, 
emphasizing sparse setups. In "moderately-sparse" situations, our 
procedure performs very well compared to known procedures tailored 
for sparse setups. It also adapts well to nonsparse situations. 

1. Introduction. Let Y = (Yi, . . . , Y n ) be a random normal vector where 
Yi ~ N(fii, 1), i = 1, . . . , n are independent. Consider the classical problem of 
estimating the mean vector fi = (/zi , . . . ,/i n ) by a (nonrandomized) estima- 
tor A = A(y) under the squared-error loss L n (/x, A) = J2ii<^i ~ f^i) 2 - The 
corresponding risk function is the expected squared error 

R( f i,A) = E lt (L n (n,A(X))). 

Compound decision theory. A natural class of decision functions is the 
class of simple symmetric estimators that was suggested by Robbins (1956). 
This is the class of all estimators A* of the form 

A*(Y) = (6(Y 1 ),...,6(Y n )) 
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for some function, 5. For such an estimator, we will occasionally write 
A*(Y") = A*(Y|5) in order to show the dependence on 6. 
Given fx = (fi\ ,...,/!„,), let 

<5*** = argmin.R(ft,A*(-|(S)) 

and, for notational convenience, let A*^ = A(-|5* /x ). 

Consider an oracle that knows the value of the vector fi but must use a 
simple-symmetric estimator. Such an oracle would use the estimator A* M . 
The goal of compound decision theory is to achieve nearly the risk obtained 
by such an oracle, but by using a "legitimate" estimator, one that may in- 
volve the entire vector of observations Y but does not involve knowledge of 
the parameter vector /x. In establishing specific results, it is important to be 
suitably precise about the (asymptotic) sense in which this near-ness is mea- 
sured. This will be discussed later, after introducing the companion concept 
of empirical Bayes, for background on both compound decision and empiri- 
cal Bayes [see Robbins (1951, 1956, 1964), Samuel (1965), Copas (1969) and 
Zhang (2003), among many other papers]. 

Empirical Bayes. Let G be a prior distribution on 1Z. Let M = {Mi,i = 
l,...,n} be an unobserved random sample from this distribution. Condi- 
tional on the {Mi} observe Yj ~ iV(Mj, 1), i = 1, . . . , n, independent. Here, 
the target procedure is the Bayes procedure, to be denoted A G . The goal 
is to find a procedure A whose expected risk under G is suitably near that 
of A G as n — > oo, when G is unknown. The notation here is intentionally 
similar to that used previously for the compound decision problem, but note 
that the superscript is now a distribution G, whereas, in the compound de- 
cision situation, the superscript is a vector fi or, equivalently, the set of 
coordinates of fi. 

Relation of compound and empirical Bayes risks. The expected average 
risk under G of a procedure A will be denoted by B(G, A). Note that 

(1) B(G,A) = E G (±R(M,A)\ 

where we treat M as a random vector whose coordinates are a sample of 
size n from G, as described above. (For convenience, the dependence on n 
is suppressed in the notation.) 

Here are some simple consequences of this relation. Let {A n } denote a 
sequence of estimators in a sequence of problems with increasing dimension 
n. Suppose, for example, that {A n } has the basic asymptotic compound 
Bayes property that, for every fx n = (/U™, . . . , /i"), n = 1, 2, . . . , 

(2) ii?(/AA n ,)--i?(/AA*O™0. 
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Also, assume that ^i?(/i,A n ) is uniformly bounded, as will typically be 
the case under suitable assumptions [as in (35)]. It then follows from (1) 
that {A n } is asymptotically empirical Bayes in the basic sense that 

BOG, A G ) = E G (-R(M, A g )) > E G -(R(M, A* M )) 
U / n 

(3) 

= E G Qi?(M, A„)) + o(l) = B(G, A n ) + o(l). 

Hence, under very mild conditions, asymptotic compound optimal in the 
sense of (2), implies asymptotic empirical Bayes in the sense of (3). 

In Section 2, we will propose a particular, easily implemented form for A n . 
In Section 3, we establish some more precise compound and empirical Bayes 
properties for this estimator. Although these properties are more demanding 
than (2) and (3), the relation (1) remains an important part of the arguments 
that establish them. 



Relation of compound optimal and empirical Bayes procedures. The re- 
lation (1) describes a close connection between the compound Bayes and 
empirical Bayes criteria. It is also true that the optimal procedures are 
closely connected. The Bayes procedure A G (Y) = (5 G (Y\), . . . , 5 G (Y n )), for 
a specified prior G, is of course given by Bayes formula 

zG, v ., _ ^r.ivN _ _ Ju<P(u-Y 3 )G(du) 



(4) 5 G (Y J ) = E(M,\Y) = E(M J \Y j [ 



J<t>(u-Y 3 )G(du) 



Among its other features, the Bayes formula (4) reveals that the Bayes 
procedure is a simple-symmetric estimator. 

A simple derivation also yields the basic formula for A*^, through the 
corresponding univariate function 5*^ 

(5) ^-Ei^Ot-u) 



Hi 00* - u) 



Given /x = (/ii, . . . , /i n ), let denote the corresponding empirical CDF. 
Then the formula for 5*^ can be rewritten as 

(6) 5^{u) = 5 K {u). 

This formula provides a direct connection between the optimal estimators 
for the two settings. This will be exploited in the construction, in Section 2, 
of an asymptotically optimal estimator. 
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Sparse estimation problems. The following discussion is intended to help 
motivate the asymptotic properties to be established in Section 3. It will also 
help motivate the choice of settings serving as the basis for the numerical 
results reported in Section 4. 

Many recent statistical results have focused on the importance of treat- 
ing situations involving "sparse" models [see, e.g., Donoho and Johnstone 
(1994), Johnstone and Silverman (2004) and Efron (2003)]. Many such prob- 
lems involve issues of testing hypotheses, but for others estimation is of 
secondary or even primary interest. The basic asymptotic empirical Bayes 
property in (3) involves asymptotic properties for a fixed prior G. Such a 
formulation is not sufficiently flexible to provide useful results in "sparse" 
settings. In Section 3, we investigate an asymptotic formulation that is ap- 
propriate for many sparse problems, as well as for the more conventional 
settings involving asymptotics for fixed (but unknown) G. 

"Sparsity" is not a precise statistical condition. However, the essence of 
many "sparse" settings is captured by considering situations in which most 
of the unknown coordinates fa take the value 0, and the remaining few take 
other value(s). 

To be precise, in the following discussion of the compound Bayes setting, 
consider a situation in which the possible values for the coordinates fa = fif 
of /x n are either fa = or fa = /io 7^ 0, i = 1, . . . ,n. Here, we consider a 
sequence of problems with increasing dimension n. For a given n, let p = p{n) 
denote the proportion of nonzero values. The situation is sparse if p(n) — ► 
as n — > 0. (For simplicity, assume that there is only one possible nonzero 
value, /io, and that this value does not change with n. Of course many other 
situations are possible that should still be classed as sparse models.) Then, 



Hence, useful asymptotic results for sparse models must accommodate this 
fact. 

The asymptotic statements in Section 3 are naturally scaled to accommo- 
date sparsity in this way because they examine the relative risk ratio, rather 
than the ordinary difference between average risks, as in the basic statement 
(2). Thus, for the given sequence, {A n } of procedures defined in Section 2, 
these results examine the limiting value of 



(7) 



-R( f x n ,A*^) = 0(p(n)). 



Note that 



(8) 




n 



(9) 



i?(/x",A n )-fl(/x",A*/*")) 

fl^n A*** n ) 
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and establish quite general conditions under which this ratio converges to 
0. (The results of Section 3 include the preceding two point model as a very 
special case.) 

Here is the empirical Bayes setting which corresponds to the special sparse 
compound Bayes model described in the previous paragraphs. Consider an 
empirical Bayes model, in which it is assumed that G = G n , where 

(10) G n ({ti O }) = m(n) = l-G n ({0}). 

Note that, as in (7), 

B(G n ,A G ") = 0(ir(n)). 

Similar to (9), the asymptotic results appropriate for sparse models will be 
phrased in terms of the limiting value of the ratio difference 

(u) B(G n ,A n )-B(G n ,A G ") 
[ ' B(G n ,AG n) 

The preceding discussion suggests that the degree of "sparsity" of a se- 
quence of compound or empirical Bayes models could be measured by the 
asymptotic behavior of R(fi n , A* M ") or B(G n , A Gn ), respectively. For exam- 
ple, sequences of models for which 

(12) liminf-i?(//\A* /1 " ) >0 

n 

[or liminf B(G n , A Gn ) > 0] could be considered nonsparse. At the other ex- 
treme are sequences for which 

(13) R(fi n ,A*^) = 0(l); 

those could be called extremely sparse. Sequences between those extremes 
can be termed moderately-sparse. A typical example could be a sequence of 
problems for which 

(14) R{ t i n ,A* fl ") = 0(n a ), 0<a<l. 

Note that, in this description of sparseness, the zero value does not play 
a special role. It is the "complexity" of the sequence or the "difficulty to 
estimate it" that defines its sparseness. 

In Section 2, we construct an estimator that is approximately compound 
optimal and empirical Bayes. The construction formula is simple and easily 
implemented. This estimator performs very well for nonsparse and moder- 
ately sparse settings, such as those in (14). It can also be satisfactorily used 
for extremely sparse settings, but it is implicit in the theory in Section 3 
and explicit in the simulations in Section 4 that its performance is not quite 
optimal in some extremely sparse settings. 
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Permutation invariant procedures. A natural class of procedures, which 
is larger than the class of simple symmetric ones, is the class of permutation 
invariant procedures. This is the class of all procedures A that satisfy 

A(Yi,...,y ri ) = (/ii,...,/i n ) ^ a (y 7r(1) ,...,y 7r(n) ) = (/Mi)>---» A^n)) 

for every permutation tt. 

In a recent paper by Greenshtein and Ritov (2008), a "strong equivalence" 
between the class of permutation invariant procedures and the class of simple 
symmetric procedures is shown. This equivalence implies that some of the 
optimality results we obtain, comparing the performance of our procedure 
with that of the optimal simple symmetric procedure for a given fx, are valid 
also with respect to the comparison with the (superior) optimal permutation 
invariant procedure. 

2. Bayes, empirical Bayes and compound decision. Let Y ~ N(M, 1) 
where M ~ G, G 6 Q. We want to emulate the Bayes procedure 5 G = 5f , 
based on a sample Y\, . . . , Y n , Y^ ~ N(Mi, 1) , i = 1, . . . , n, where Mj ~ G and 
the Yi are independent conditional on Mi, . . . , M n , i = 1, . . . , n. In general, 
G may depend on n, but, in order to simplify the notation and presenta- 
tion, we consider a fixed G throughout this section. The generalization for 
a triangular array is easily accomplished. 

Consider our problem for a general variance a 2 ; that is, suppose Yi ~ 
N(Mi,a 2 ), Mi ~ G, i = 1, . . . , n. Let a2 be the mixture density 

Then, from Brown (1971), (1.2.2), we have that the Bayes procedure denoted 
8^2, satisfies 

(16) KM = V + ° -r-^y 

Here, gfi ^(y) is the derivative of g* G ^(y)- 

The estimator that we suggest for 5f is of the form 

(17) t-y+m, 

where g%(y) and g^(y) are appropriate kernel estimators for the density 
gQ x (y) and its derivative g^ 1 (y) . The subscript h is the bandwidth for the 
estimator. We will use a normal kernel. This choice is convenient from sev- 
eral perspectives, but does not seem to be essential. See Remark 2 later in 
this section. An alternative to kernel density estimators could be a direct 
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estimation of G. An approach involving MLE estimation of G was recently 
suggested by Wenhua and Zhang (2007). Its performance in simulations is 
excellent and it has appealing theoretical properties. However, it is compu- 
tationally intensive. 

Let h > be a bandwidth constant. Typically, h will depend on n, and 
linin^oo h = 0. Then, define the kernel estimator 

Its derivative has the form 
Let 

v = l + h 2 . 

The following simple lemma establishes that and <?^' are unbiased es- 
timates of g G v and Qq v - It also further interprets their form. 

Let G^ denote the empirical distribution determined by Y\ , . . . , Y n . 

Lemma 1. Let h > and v = 1 + h 2 , and suppose Yi ~ N(Mi, 1), where 
Mi ~ 67 are independent. Then, 

(20) ft(v) = 0£*«_i(v), ^(y) =9as,v-M 

(21) Egfa v _ 1 {y)=g* G>v {y) 1 Eggr^y) = g% v (y). 
PROOF. We write 

9t(y)= J \^{^)dG Y n {dt)=g GYnh2 {y), 

since h~ 1 (j)(x/h) is the normal density with variance h? . Let <& a 2 denote the 
normal distribution with variance a 2 . Under the conditions of the lemma, 
£(G£) = G**i. 

Hence, E(g* h (y)) = g^ uh a(y) = 9 Gjl+h 2(y), since (G * $i) * = G * 

The arguments for the derivatives follow by differentiation or by an inde- 
pendent argument analogous to the above. This completes the proof. □ 

Hence, the basic formula (17) may be rewritten as 



(22) 5 1+h 2(y) = 5 v (y)=y + 



9*GZ, h M' 
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As a final step in the motivation of our estimator, note that 

(23) 6? +h2 (y) = 5C(y) h ^5?(y). 

By Lemma 1 and (23), we expect that, for large n and v = 1 + h 2 
have 

c*' 

(24) 



1, we 



9*Gl,v-M 9gM 9gM 



Sar, v -l(v) 9g,v(v) 9* G ,l(y) 
Similarly, we have 

s?(y) = y + {5?(y)-y)~y+ ( 



(25) 



y + 



y+(v-l) 



1 ' 9h,M 

9Gi,v-i(yh 



9 k,v-i(y) 



y + ^l(Sv-i(y)-v) = Uy)- 



Here, 5 V " 1 is as defined above (16). 

Remark 1. All the equations obtained so far for the empirical Bayes 
setup have a parallel derivation and presentation in the compound decision 
setup for a given /i = (fj,\, . . . ,// n ), where Ffi 1 , the empirical distribution of 
H = (/ii, . . . ,fJ, n )i plays the role of G, as in (6). For example, (15) has the 
form ^J2i ^(^V^)' an d the analog of 5^ 2 is denoted 8*J£ , etc. 

Example 1. It is of some interest to examine how the preceding for- 
mulas compare in the standard case where the true prior is Gaussian, say 
G ~ iV(0,7 2 ). In that case, G\ N(0, 1 + 7 2 ) in distribution. The actual 
Bayes procedure is 

6?(Y) 



1 



Y. 



1 + 7 2 

while, by (25) for a fixed v, 5 V (Y) converges as n — ► oo to 

1 



1 



v + 7" 



Y. 



This may be seen when substituting 5^'j ) ' 1+7 \y) = (1 — ^ + \ )y, for 5^" 1 (y) 
in (25). 

Thus, when letting v = v n approach 1 (equivalently when letting the band- 
width h = \/v — 1 approach 0) as n approaches infinity, we may see that 5 v (y) 
approaches Sf(y). 
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Remark 2 (On the choice of a kernel). One could choose other kernels 
and obtain corresponding different estimators. See, for example, the papers 
of Zhang (1997, 2005). In those papers, Zhang introduces an estimator for 

, using Fourier methods and corresponding kernels, to estimate the above 
§q 1 and g*Q 1 . Zhang's papers are very relevant, and there are similarities 
between our approach and his earlier development. 

We now point to some advantages of our kernel. One advantage is the 
interpretation of 5 V as an approximation for 5^ . Here, S^(u) is the Bayes 
decision function for the setup where U ~ N(M,v) and M ~ G (see, e.g., 
Example 1, with the interpretation of the obtained rules, in terms of the 
approximation G\ of G). This interpretation is very helpful in the proof 
of Theorem 1. We are not sure to what extent a normal kernel is essential 
to obtain the good performance of our estimator, but it certainly simpli- 
fies various arguments. In addition, kernels with heavy tails would typically 
introduce a significant bias when estimating g* G 1 and its derivative in the 
tail. 

3. Optimality in compound decision under sparsity. In this section, we 
study asymptotics which are appropriate for both nonsparse and sparse com- 
pound decision problems. The traditional asymptotics for empirical Bayes 
and compound decision, consider the difference in average risks between the 
target (or optimal) procedure and a suggested estimator. In the sparse set- 
ting, both of these quantities approach zero. So the traditional asymptotic 
criteria are not informative, and a more delicate study is needed. 

Our main result, Theorem 1, covers the compound decision framework. 
It has an analogous empirical Bayes formulation which is obtained as a 
corollary. 

The formal setup is of a triangular array, where, at stage n, the pa- 
rameter space, denoted O n , is of dimension n. We use the notation fi n = 
K,...,<)G0". 

For every e > 0, we assume 

(26) <C n = o(n £ ), n = l,2,...,oo, j = l,...,n. 

Such configurations include the interesting cases where = 0(\/log(n)). 
Those are interesting configurations in which the statistical task of discrim- 
inating between signal and noise is neither too easy nor too hard. 

As before, we observe a vector (Y™, . . . , Y™), where Y™, are independently 
distributed iV(/i™,l). Consider the loss for estimating fi n by jl n , 

(27) L(/AA") = E(A?-^) 2 , 

i=l 



here fi n = , . . . , /%). 
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In this section, we will introduce the following slight modification for 
5 v (u), and will consider a truncated estimator which at stage n is of the 
form 

(28) 6l(u) = sign(4(«)) x min(C n , \5 v {u)\). 

Note that we chose to truncate S v so that \S V \ < C n . An alternate trun- 
cation can be used that may be more desirable in practice. This involves 
truncation of 5 v (y) — y, rather than 5 V . In this case, the truncation level can 
be chosen independent of C n . We write 5 v (y) = y + (6 v (y) — y) = y + R. Let 
R = sign(ii) min(|i?|, v / 31og(n)). The alternate truncated estimator is 

(29) y + R. 

This estimator also satisfies the conclusion of Theorem 1 and our other 
results. Minor modifications of the proofs are needed. Let A* (Y) = A*(Y\5 i v ) 
denote the simple symmetric estimator of /i n . Recall v = 1 + h 2 . We now 
state our main result. 

Theorem 1. Consider a triangular array with Q n , as above, and se- 
quences /x n G Q n as in (26). Let v = v n — > 1, v > 1, be any sequence satisfy- 
ing: 

(i) ^Tj- = o(n £ ) for every e' > 0. 

(ii) log(n) = o(^ T ). 

Assume that, for some e > and no, 

(30) R( t i n ,A fJ '")>n £ Vn>n . 
Then, 

(31) hmSUp ^,A^) =L 

Remark 3. Theorem 1 states that, in situations which are not too ad- 
vantageous for the oracle so that its risk is of an order larger than n e for 
some e > 0, we may asymptotically do as well as that oracle by letting v 
approach 1 in the right way. Doing as well as the oracle means that the 
ratio of the risks approaches 1. Note that some condition resembling (30) is 
needed; if, for example, /x n = (0, . . . , 0), n = 1, 2, . . . , then the corresponding 
risk of the oracle is identically 0, and we can obviously not achieve such a 
risk by our estimator. 

Although the asymptotics in this section are motivated mainly by sparse 
setups, the result in Theorem 1 is valid for any sequence \x n satisfying (26) 
and (30). Obtaining an estimator that performs well and adapts well to a 
broad range of "sparseness"/"denseness," is the main achievement in this 
paper. The simple, easily interpretable form of our estimator is an additional 
useful feature. 
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Remark 4. There is an alternate form for the conclusion (31) that 
avoids the necessity for an explicit assumption like (30). A minor additional 
argument shows that in the statement of the theorem one can omit (30) and 
replace the conclusion (31) by the conclusion 

32 limsup — \ \, v) -. < 1 

for all sequences {A n } such that A n > n 6 ' for some e' > 0. 

Remark 5 (On the choice of the bandwidth). The asymptotic result of 
the theorem requires that h n — > 0, but at a fairly slow rate. This slow rate is 
needed in order to obtain the general conclusion in (31), assuming any value 
of e in (30). However, when (30), holds for large values of s (e.g., nonsparse 
case with e = 1), then smaller values of h n might be desirable and will have 
some theoretical advantage. Our theoretical results suggest that h\ should 
converge to zero "just faster" than l/log(n); we recommend h\ = l/log(n) 
as a "practical default choice." This choice was studied in our simulations 
and also in Brown (2008) and Greenshtein and Park (2007), where real data 
sets are explored. 

One could improve by selecting different values of bandwidths for different 
points y in an adaptive manner. Obviously, smaller bandwidth are desirable 
in the "main body" of the distribution and bigger ones on the tail. Also, one 
could use different bandwidth when estimating the density and its deriva- 
tive at a point (typically larger bandwidth for estimating the derivative). 
Such an approach (and possible improvement) would introduce computa- 
tional complexity to our simply computed estimator. We do not pursue this 
approach in the present manuscript. 

From now on, we will occasionally drop the superscripts t in <5*, and n 
on Li n . Recall the notation 5*^ for the optimal simple symmetric function 
given fx, when ~ N(fii,v). Thus, 5*^ = 8^. 

Write 

Theorem 1 will follow when we prove the following two lemmas and apply 
Cauchy-Schwarz. 

Lemma 2. For v = v n > 1, such that log(n) = o(^-), and fx n £ 6 n as 

in (26), 

(33) Hm vMfeif ! 
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Proof. See Appendix. □ 



Lemma 3. Let e > (arbitrarily small). Suppose that v = v n > 1, satisfy 
= o(n £ ) for every e' > 0, and n n G n as in (26). Then, 

(34) " to")) 2 = °(»'). 



Proof. See Appendix. □ 



A result analogous to Theorem 1 for the empirical Bayes setup is obtained 
as a corollary. Consider a triangular array where at stage n, we observe 
Y? ~ 7V(M?\ 1), Mf ~ G„, i = 1, . . . ,n, Mf are independent and Y? are 
independent conditional on M™, i = 1, . . . , n; G n , are unknown. Assume that 
G n have a support on (— C n ,C n ), where 

(35) C n = o(n £ ') 

for every s' > 0. Let <5^ n be the sequence of Bayes procedures. In the follow- 
ing corollary, the expectation is taken with respect to the joint distribution 

of (A/f,yf),...,(M-,y-). 



Corollary 1. Let e > (arbitrarily small). For any sequence v = v n > 
1, such that: 

(i) = o(n £ ') /or even/ e' > 0, 

(ii) log(n) = o(^), 



(36) lim sup w < 1 . 



Proof. The corollary is obtained by conditioning on every possible real- 
ization M n = {My, . . . , M") and applying Theorem 1 coupled with Remark 
4 on each realization treating the conditional setup as a compound decision 
problem. The proof follows, since, by definition, for every M n (treated as a 
fixed vector), Ei^(<5* M " - M™) 2 < E^MJ^?"^) - M? f . □ 



Remark 6. Assuming the more restrictive condition C n = \J K\og{n), 
for some K, a careful adaptation of our proof will yield the conclusion of 
Theorem 1 under the weaker assumption that R(/jL n , A M ") is a suitable power 
of log(n). 
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4. Simulations. This section will demonstrate the performance of our 
method in a range of settings. As explained, the value of v should decrease 
as n increases and should be chosen bigger than 1 but close to 1. We used 
v = 1.15 in simulations with n = 1000, v = 1.1 when n = 10,000 and v = 1.05 
when n = 100,000. No attempt was made to optimize v. Note our default 
recommendation choice, v = 1 + (l/log(n)) equals 1.144 and 1.108 for n = 
1000 and n = 10,000, correspondingly, roughly according to our choice. For 
n = 100,000, we chose v = 1.05 rather than 1.086 in order to keep a gap of 
0.05. However, small changes (say, take v = 1.15 rather than v = 1.1) did 
not have much of an effect. 

In Table 1 of Johnstone and Silverman (2004) [cited bellow as J-S (2004)], 
the performances of eighteen estimation methods were compared in various 
setups and configurations. Those methods include soft and hard universal 
thresholds and others. The performance was compared in terms of the ex- 
pected squared risk. In all the configurations, the dimension of the vector fi 
is n = 1000 [i.e., Yi ~ iV(/ij, 1), i = 1, . . . , 1000]. In four configurations, there 
are k = 5 nonzero signals and these nonzero signals all take the value u\ = 3, 
or all are U\ = 4, u\ = 5 or u\ = 7, respectively. A similar study was done 
when there are k = 50 nonzero signals, and k = 500 nonzero signals; the 
values of the nonzero signals are as before. 

In the second line of the following Table 1, we show the performance of 
the best among the eighteen methods in each case (i.e, the performance of 
the method with minimal simulated risk for the specific configuration). The 
first line shows the performance of our S v with v = 1.15. The performance 
and empirical risk of our procedure is based on averaging of the results of 50 
simulations in each configuration. One can see that the empirical risk of our 
procedure is lower than the minimum of all the others in the nonsparse case 
and in the moderately sparse case. Our procedure adapts particularly well 
in the nonsparse case. Our method does not do that well in the extremely 
sparse case; it is worse than the various empirical Bayes procedures suggested 
in Johnstone and Silverman's paper, but it is within the range of the other 
methods. All entries, in this table and those to follow, are rounded to the 
nearest integer. 



Table 1 

Risk of Si. 25 compared to that of the best procedure in J-S (2004); 
n = 1000 (average of 50 simulations rounded to the nearest integer) 
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50 








500 
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7 


<5l.l5 


53 


49 


42 


27 


179 


136 


81 


40 


484 


302 158 


48 


Minimum 


34 


32 


17 


7 


201 


156 


95 


52 


829 


730 609 


505 
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Note that, when the risk of the oracle is very small, our Theorem 1 does 
not imply that we are doing well with respect to the oracle in terms of risk 
ratios. 

Here, 5 V denotes the following minor adaptation of 5 V : 

x . 9G n ,v-i(y) 

Ov=V + V— -r-r. 

9 Gn ,v-i(y) 

The difference, relative to 8 V , is the multiplication by v of the ratio. As 
v — > 1 the difference between the two estimators is negligible. The procedure 
S v seems more suitable in approximating, <5* M and is as appealing as S v . 

In the following Table 2, we report on the behavior of our procedure 
based on 50 simulations in each of the following three configurations. The 
dimension is n = 10,000 and there are k = 100, k = 300 and k = 500 nonzero 
signals. The nonzero signals are selected, by simulation uniformly between 
—3 and 3, independently in each simulation. 

We compare the performance of our procedure with a hard threshold 
Strong Oracle, whose loss per sample i: 

mm^(y 4 - »i) 2 I{\Yi\ > C) + (0 - ^) 2 I(\Y t \ < C). 

i 

Thus the Strong Oracle applies the best hard threshold per realization. The 
entries in Table 2 are based on the average of 50 simulations. 

We see that the Strong Oracle, dominates our procedure in the very sparse 
case where k = 100. Our procedure dominates in the less sparse cases. 

In the following Table 3, we report on the behavior of our procedure, based 
on 50 simulations in each of the following configurations. The dimension is 
n = 100,000 with k nonzero signals, and each has the value 4. The simula- 
tions are performed for k = 500, k = 1000 and k = 5000. The comparison is 
again with a Strong Oracle. In our procedure we let v = 1.05. Note that our 
procedure dominates the SO in each case. Our procedure thus appears more 
advantageous as the dimension increases and there are more observations 
available to estimate <5* /x . 







Table 2 






Risk of Si.i 


compared with that 


of a Strong Oracle; 






n = 10,000 
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= 100 


306 


295 


k 


= 300 


748 


866 


k 


= 500 


1134 


1430 
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Table 3 
Risk of 5i.o5 compared 
with that of a Strong Oracle; n — 100,000 







Sl.05 


SO 


k 


= 500 


2410 


3335 


k 


= 1000 


3810 


5576 


k 


= 5000 


10,400 


16,994 


APPENDIX 



Proof of Lemma 2. Let r VjU be the risk corresponding to 5* M (u) when 
applied on an independent sample UJ 1 ~ N^fi^^v), and let v\ n be the risk of 
8\*(y) when applied on an independent sample Y- n ~ N(fj,i, 1), i = 1, . .. ,n. 
Then, 

(37) E„ n J2(57(U?)-tf) 2 = r v , n , 

(38) E^{5^{Yn-^f = ri, n . 

We will omit the superscript n in the following. 

Obviously, r VyU > ri jn , since the experiment Yi, i = 1, . . . ,n dominates the 
experiment Ui,i = 1, . . . ,n in terms of comparison of experiments. We will 
first show that for v = 1 + (l/d n ) where log(n) = o(d n ), 

(39) ri >n /r Vt n^l. 

Let cj)(u, nf, s 2 ), denote the normal density with variance s 2 and mean /Xj. 
For every e' > 0, we have 

Tv , n =j;b w to) - < E^(^r(^) - w) 2 

= Y1 J (^(u) - fii) 2 (j)(u,Hi;v)du 

i 

(40) = J2[(Sr(u)-^ 2 <t>(u, K l)^^du 

(41) = (1 + o(l)) x 5> W (*I M W) -M*) 2 + o(n £ ') 

i 

(42) =(l + o(l)) xn, n + o(n e '). 

Equation (41) is implied as follows. When d n /\og(n) — > oo, then, for each 
summand % in (40), the ratio of the densities approaches 1 uniformly on 
the range where \u — fii\ < \/K\og{n) for any K. It is also easy to check 
for each summand, that, for large enough K, the integral over \u — > 
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^JK log(n) can be made of the order o(n^ £ for any e' . This may be seen 
since \5* fl \ and \m\ are bounded by C n , while by choosing K large enough 
n x (2C n ) 2 x P(|Yj — /Ui| > y/K log(n)) can be made of order o(n e ). 

By (30), letting e' < e, (42) implies limsupr U)n /ri )n < 1. This completes 
the proof of (39), since, as mentioned, r\ jn < r v ^ n . 

Similarly to the above, we write 

]T WW - = J2j (W) - Mi ) 2 <K*, m; i) dt 

(43) 

An argument similar to the above (yet easier) implies that for d n /\og(n) — ► 
oo we have 

(44) ^ ► 1 

Lemma 2 now follows from (39) and (44). 

Note that Lemma 2 would follow along the same lines if we assume in 
(30) the weaker condition i?(^i n , A M ) = O(l) (i.e., under our notion of an 
extremely sparse setup). □ 

Proof of Lemma 3. In order to motivate the expression in (46), bel- 
low, we begin by comparing the performance of 5*^ and 5 V when applied 
on a set of new independent observations ~ N(/j,f, 1), i = 1, . . . , n, which 
are also independent of the set YJ n , i = 1, . . . , n, that was used to obtain the 
estimate 6 V . We will omit the superscript n in the following. Thus, we first 
show that 

(45) E^r^-UYi)? = o{tf) 

i 

for every e > 0. 
Observe that 

£#*ZK M #) - W)) 2 = E Hj ^(v) - Uv)?<Kv - tn) d y 
* » 

46 

= n / ^[(^ M (y)-<J„(y)) 2 ]fl£,i(y)dy. 

J— oo 

Here, 3g 1 (y) = ^Y^i4>(y ~ Mi) denotes the mixture density in the com- 
pound decision setup, where G corresponds to the empirical distribution of 
(Hi, . . . , // n ) (see Remark 1, Section 2). In fact, G = G%, but the dependence 
of G on n and /i is suppressed in the notation. 

The outline of the proof, that (46) is of order o(n e ) for every e > 0, is as 
follows. Let < ef < e. Let 1Z consist of all point y that satisfy: 
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(i) -C' n < y < C' n , where C' n = (log(n) + C n ) 
and, 

(ii) g* G 1 (y) > n £ ~ 1 for some < e' < e. 
We then show that, uniformly for y$ £ 1Z, 

(47) msT(yo) -s v (yo))] 2 = o 



Once (47) has been verified, the proof of (45) can be completed, since 

n r E[(5Tiy)-Uy)) 2 ]ah,i(y)dy 

J — oo 

= n[ E[(8^(y)-5 v (y))] 2 g* G , 1 (y)dy 



(48) 



+ n [{tr{y)-Uy))?ghAy)dy 

Jyi[-C' n ,C' n ] 

+ n f E[{5T{y)-5 v {y))] 2 gh.i{y)dy 



o(n £ ) + (n [ ) - g* a i (y) dy 



(49) 

= o{n £ )+o{C' n n £ ')=o(n £ ). 

In the above, we use the exponential tail of the normal distribution, the 
truncation, and the upper bound C n for the elements of /x™. 

We elaborate now on the derivation of (47). The (nontruncated) version 
of our estimator equals 

(50) ^>^ + Hf 

where g^ and g% are kernel density estimators, based on Y\, . . . , Y n , of g G 1 
and g*Q 1 with bandwidth h = y/v — 1 = \J\jd n . 
Recall that 

(51) 5^(y) = y + v 9 4^\. 
We now write the right-hand side of (50) as 

9G, v [y) + Ri 



r*' 

(52) S v (y) = y + 



Here, the random variables R± and R2 are implicitly defined, by comparing 
numerators and denominators of (52) and (50), respectively. 
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Note that, by Lemma 1, 
(53) £7(120 = 0, * = 1,2, 
E[5r(y Q )-8 v (y )} 2 



0\E\—^ )\e( 



(54) 



_Ri \ 2 + E ( C n R 2 



g G ,v(yo) + R 2J \g* G ,v(yo) + R 2 
Rl „ Y + e( 



-9G,i(yo) + R z' \9<3,i(yo) + R 2- 

For the last equality, we use the fact that g* G 1 (y) / g G v (y) is bounded when 
v > 1; for the previous one, we use the fact that 9q v {yo) / 9q v {yo) = 0(C n ), 
uniformly for yo £lZ. 

The assertion E[(S*^(y ) - 5 v (y ))] 2 = ng °i"^ o) , will be implied by com- 
puting the variances of R^, i = 1, 2, and by controlling the moderate deviation 
of 1?2, as in what follows. 

The variances of l?i and 1?2 equal to the variances of the corresponding 
kernel density estimators in (50), of the density and its derivative. It may 
be checked, from (18) and (19), that 

, KK v ,„v ^/(Cn^n) 2 9G,i(yo)\ o(n £ ')g G1 (y ) 

(55) var(120 = O ! = ! 

\ n J n 

for every < e < e' . 

Applying Bernstein's inequality [see, e.g., van der Vaart and Wellner 
(1996), page 103] we obtain 

(56) P(R 2 < -0.5^,i (y )) < 1/4C* 

Since [S v (yo) — 5* M (?/o)] 2 < 4C^ by truncation, (47) follows when incorporat- 
ing the above computed values of the second moments of Ri, i = 1,2 into 
the numerator of (54) and controlling its denominator by (56). 

It remains to show how to modify the proof of (45) in order to conclude 
£>ZX<^ M (^0 ~ K{Yi)) 2 = o(n s ) for every e > 0. We briefly explain it in the 
following. 

Let 5{, be our estimator for 5*^ based on Yj, j = 1, . . . , n, j ^ i. We now 
write 

E^STm-UYi)? 

(57) 

= E^(K^Yi) ~ + 8®QQ - 5 v (Y t )) 2 . 
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We now show that 

(58) E^($ v (Y i )-% { )(Y i )) 2 = o(n°) 

i 

for every e > 0. This follows by arguments similar to the ones presented in 
the first part of our lemma. Specifically, first note that 

i 

(59) = E^ta) - J« (Yi)f x I(Yi G ^) + o(n £ ) 

i 



na*(Fj) 

Now, taking a dense enough grid in the region 1Z [note the derivative of Qq 1 
in that region is bounded by 0(C n )], and applying Bernstein's inequality 
coupled with Bonferroni, yields 



\ y eK9*(y) 2 / VraC2, 
By (59) and (60) and the truncation, we obtain 

o{n 6 ') . /■ 2 x o(n e ') „, 

The above involves interchanging summation and integration. 
We then note that 



^-.s n o(n e ') . F . /" 2 x o(n E ') 

^^ng*(Yi) Jk g*(y) 



= e„J2 J{5T{y)-^ ) {y)f<P(y-^)dy 

(62) 

= ^ E / (Wf) - ^(y) + <W - ^O/)) 2 ^ - Mi) 

i 

= o(n £ ') 

for every e > 0. 

Obtaining the last equality involves evaluating 

^E/ E(5^(y)-5 v (y)) 2 ^y- fM )dy 

i 

as in (46), and 

''£E(S v (y)-8i { )(y)) 2 <Ky-iH)dy 
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similarly to (58). This completes the proof. □ 
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